Understanding the Hardware and the Timeframe

The biggest issue I ran into during this project is my hardware limitations. I have a generally ok computer but I do not have the bandwidth capacity nor the GPU power and RAM necessary for this project. So I did most of all computations on AWS instances. I used two instances: t3a.medium with 350 GB total volume and g4dn.2xlarge with total 260 GB volume (see note). Let run through the specs.

t3a.medium - AMD processor - 2 vCPUs - 4 GB RAM - No GPU

g4dn.2xlarge - Intel processor - 8 vCPUs - 32 GB RAM - Nvidia T4 with 16 GB VRAM

**Note the GPU powered AWS instances usually comes with a set (cannot change) storage on a NVMe SSD. The g4dn.2xlarge comes with 225 GB.*

Project Timeline

The timeframe for this project was the second biggest issue I ran into during the course of this project. To get the ability to rent the lower-costed g4 instances I first had to request a quota increase so that I could actually rent the instance. And training the model and pre-processing all the images took an absurdly long time. Taking away coding and research times this is the entire time flow of this project:

  • Downloading all the images took around 90 minutes
  • Preprocessing the images took about 30 hours
    • each 200 augmentation for each image took around 30-40 seconds.
  • Uploading all the images to an s3 bucket took around 90 minutes
  • Downloading all the images to the new g4 instance took around 90 minutes
  • Training the model took around 13 hours
    • each epoch took around 35 minutes to train and another 10-15 minutes for validation

In total, just the wait time alone was around 48 hours. As for the reason why I didnt do image downloading, processing and model training on the same instance was to save money. GPU powered instances are all extremely expensive compared to non-GPU powered ones. For example the t3a.medium instance is around $0.04 per hour while the g4dn.2xlarge was around $1.07 per hour when I was renting it.

Cost Considerations

When working with cloud infrastructure, especially GPU instances, cost optimization becomes crucial. Here’s a breakdown of my approach:

Why Use Two Different Instances?

The key insight was to separate compute-intensive but non-GPU tasks from GPU-accelerated tasks:

  1. CPU-only tasks (t3a.medium):
    • Downloading images from APIs
    • Image augmentation with OpenCV and Albumentations
    • Uploading processed data to S3
    • Cost: ~$0.04/hour
  2. GPU tasks (g4dn.2xlarge):
    • Model training with TensorFlow/Keras
    • Validation during training
    • Cost: ~$1.07/hour

By using this split approach, I saved approximately: - Image preprocessing: 30 hours × ($1.07 - $0.04) = ~$30.90 - Data transfer time: 3 hours × ($1.07 - $0.04) = ~$3.09 - Total savings: ~$34

This might not seem like much, but for personal projects and experimentation, these costs add up quickly.

Lessons Learned

Storage Management

Working with 110 GB of augmented images taught me several important lessons:

  1. S3 as intermediate storage - Using S3 to transfer data between instances was essential
  2. EBS volume sizing - I had to carefully plan storage capacity for each instance
  3. Cleanup strategy - Deleting intermediate files became important to manage costs

Performance Bottlenecks

The biggest time sinks were:

  1. Data augmentation - 30 hours for processing 3,595 cards × 200 augmentations each
  2. Data transfer - Moving 110 GB between instances and S3 took several hours
  3. Training time - Each epoch taking 45+ minutes meant I had to be strategic about hyperparameter tuning

What I’d Do Differently

If I were to redo this project, I would:

  1. Start with fewer augmentations - Maybe 50-100 per image to iterate faster
  2. Use spot instances - Could save 50-70% on GPU costs
  3. Implement better monitoring - Set up CloudWatch alerts for cost tracking
  4. Consider local development - For the augmentation phase, a local machine might have been more cost-effective